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GENE-CENTRIC GENE GENE INTERACTION: 
A MODEL-BASED KERNEL MACHINE METHOD 1 

By Shaoyu Li and Yuehua Cui 

Michigan State University and St. Jude Children's Research Hospital 

Much of the natural variation for a complex trait can be ex- 
r£ \ plained by variation in DNA sequence levels. As part of sequence 

variation, gene-gene interaction has been ubiquitously observed in 
nature, where its role in shaping the development of an organism 
has been broadly recognized. The identification of interactions be- 
tween genetic factors has been progressively pursued via statistical 
or machine learning approaches. A large body of currently adopted 
methods, either parametrically or nonparametrically, predominantly 
focus on pairwise single marker interaction analysis. As genes are the 
functional units in living organisms, analysis by focusing on a gene 
as a system could potentially yield more biologically meaningful re- 
sults. In this work, we conceptually propose a gene-centric framework 
for genome-wide gene-gene interaction detection. We treat each gene 
as a testing unit and derive a model-based kernel machine method 
for two-dimensional genome- wide scanning of gene-gene interactions. 
In addition to the biological advantage, our method is statistically 
appealing because it reduces the number of hypotheses tested in 
a genome-wide scan. Extensive simulation studies are conducted to 
evaluate the performance of the method. The utility of the method is 
further demonstrated with applications to two real data sets. Our 
method provides a conceptual framework for the identification of 
gene-gene interactions which could shed novel light on the etiology 
of complex diseases. 

1. Introduction. Accumulative evidence shows that much of the genetic 
variation for a complex trait can be explained by the joint function of multi- 
ple genetic factors as well as environmental contributions. Searching for these 
contributing genetic factors and further characterizing their effect sizes is one 
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of the primary goals and challenges for modern genetics. The recent break- 
throughs in high-throughput genotyping technologies and the completion of 
the International HapMap project provide unprecedented opportunities to 
characterize the genetic machinery of living organisms. Genetic association 
analyses focusing on single nucleotide polymorphisms (SNPs) or haplotypes 
have led to the identification of many novel genetic determinants of complex 
disease traits. However, despite enormous success in genome-wide associ- 
ation studies, single SNP or haplotype based studies still suffer from low 
replication rates because of the infeasibility of dealing with the complex 
patterns of association, for example, genetic heterogeneity, gene-gene inter- 
action and gene-environment interaction. Many of the genetic components 
of many diseases remain unaccounted for and only a small proportion of the 
heritability has been explained. 

It is commonly recognized that genes do not function alone, rather they in- 
teract constantly with each other. Gene-gene interactions have been broadly 
considered as important contributors to the unexplained heritability of com- 
plex traits [Thornton- Wells, Moore and Haines (2004), Maher (2008), Moore 
and Williams (2009), Eichler et al. (2010)]. Current methods for gene- 
gene interactions are mostly focused on single locus interactions, using ei- 
ther parametric methods such as the regression-based tests of interaction 
[Piegorsch, Weinberg and Taylor (1994)] and Bayesian epistasis mapping 
[Zhang and Liu (2007)], nonparametric methods such as the entropy-based 
approaches [Kang et al. (2008)], or data mining methods such as the multi- 
factor dimensionality reduction (MDR) [Ritchie et al. (2001)] and random 
forests [Breiman (2001)]. Methods based on interaction of haplotypes have 
also been developed [e.g., Li et al. (2010), Li, Zhang and Yi (2011)]. Due to 
the issue of haplotype phase-ambiguity, however, haplotype-based interac- 
tion analysis is limited to small sized haplotypes. Extension to large sized 
haplotype interaction is computationally challenging. For a comprehensive 
review of statistical methods developed for detecting gene-gene interactions, 
readers are referred to Cordell (2009). 

A number of research reports have argued the relative merit of gene-based 
association analysis [e.g., Neale and Sham (2004), Jorgenson and Witte 
(2006), Cui et al. (2008), Ma et al. (2010)]. Neale and Sham (2004) argued 
that a gene-based approach, in which all variants within a putative gene are 
considered jointly, have relative advantages over single SNP or haplotype 
analysis. As genes are the functional units in a human genome, variants in 
genes should have high probability of being functionally more important 
than those that occur outside of a gene [Jorgenson and Witte (2006)]. Be- 
cause of this characteristic, gene-based association analysis would provide 
more biologically interpretable results than the single-SNP or haplotype 
based analysis. Moreover, when multiple variants within a gene function in 
a complicated manner, the gene-based association test can gain additional 
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power by capturing the joint function of multiple variants simultaneously 
compared to a single SNP analysis [Cui et al. (2008), Buil et al. (2009)]. 
In addition, a gene-based analysis is statistically appealing. By considering 
multiple SNP markers within a gene as testing units, one can reduce the 
number of tests, hence releasing the multiple testing burden and improving 
association test power. 

The relative advantage of gene-centric analysis motivates us to consider 
genes as modeling units to identify interactions in a gene level. It is our 
expectation that the identification of genetic interactions in a gene level 
should carry the same benefits and gains as it does with gene-based associa- 
tion analysis. We therefore propose to jointly model the genetic variation of 
SNPs within a gene, then further test the interaction in a gene level rather 
than in a single SNP level. This conceptual definition and modeling of gene- 
centric gene-gene (denoted as 3G) interaction would change the traditional 
paradigm of gene-gene interaction analysis and help us gain novel insight 
into the genetic etiology of complex diseases. In addition to its biological 
merits, by focusing on genes as testing units, the number of pairwise inter- 
action tests can be dramatically reduced compared to a single SNP-based 
pairwise interaction analysis. Thus, the 3G interaction analysis is also sta- 
tistically appealing. 

Following the definition of the gene-centric interaction, we propose a model- 
based kernel machine method to identify significant gene-gene interactions 
under the proposed 3G analysis framework. Kernel-based methods have been 
proposed to evaluate association of genetic variants with complex traits in 
the past decades [e.g., Tzeng et al. (2003), Schaid et al. (2005), Wessel and 
Schork (2006), Schaid (2010a, 2010b)]. A general kernel machine method can 
account for complex nonlinear SNP effects within a genetic feature (e.g., 
a gene or a pathway) by using an appropriately selected kernel function. 
Generally speaking, a kernel function captures the pairwise genomic similar- 
ity between individuals for variants within an appropriately defined feature 
[Schaid (2010a)]. The application of kernel-based methods in genetic associ- 
ation analysis has been reported in the literature [e.g., Schaid et al. (2005), 
Kwee et al. (2008), Wu et al. (2010)], but none of them considers gene-gene 
interactions. In this work, we propose a general 3G interaction framework 
by applying the smoothing-spline ANOVA model [Wahba (1990)] to model 
gene-gene interactions. The proposed method, termed Gene-centric Gene- 
Gene interaction with Smoothing-sPline ANOVA Model (3G-SPA), is im- 
plemented through a two-step procedure: (1) an exhaustive two-dimensional 
genome-wide search for any genetic effects; and (2) assessment of significance 
of interactions for the identified gene pairs. 

The rest of the paper is organized as follows. In Section 2 we describe the 
detailed model derivation of our method. We propose two score statistics 
for testing the overall genetic effect and the interaction effect. To evaluate 
the performance of the proposed method, Monte Carlo simulations are per- 
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formed in Section 3. The utility of the method is demonstrated by two real 
data analyses in Section 4, followed by discussion in Section 5. 

2. Statistical methods. 

2.1. Smoothing spline-ANOVA model. We assume n unrelated individu- 
als sampled from a population, each of which possesses a measurement for 
a quantitative disease trait of interest. The quantitative measurements of n 
individuals are denoted as y = (yi, y 2 , ■ ■ ■ , y n ) T ■ Traditional approaches for 
detecting gene-gene interactions, such as MDR or regression type analysis, 
identify SNP-SNP interactions. In this work, we focus our attention to pair- 
wise gene-gene interactions by considering each gene as a unit. Consider two 
genes, denoted as G\ and G2, with L\ and L2 SNP markers, respectively. 
Let Xj = (xi t i, . . . ,Xi t L) be an 1 x L genotype vector of the gene pair for 
subject i, where L = L\ + L2 is the total number of SNP markers in the 
two genes. We model the relationship between the genotypes of the gene 
pair (xj) and the phenotype yi by the following model: 

(2.1) yj = m(Xj)+£i, i = l,2,...,n, 

where m is an unknown function and e% ~ AA(0,<7?) is a random subject- 
specific error term and independent of Xj. Here o~~ (=c 2 ) is generally as- 
sumed to be homogeneous. 

Gu (2002) has discussed the ANOVA decomposition of multivariate func- 
tions on generic domains of each single coordinate. Actually, the decom- 
position can also be defined on nested domains (see Appendix A). Follow- 
ing a similar idea, the genotype vector is partitioned as Xj — [x^ ^ , x| ^ ] , 
where x$ represents the Lj SNP predictors for gene j (j = 1, 2). Let a prod- 
uct domain be X = ® X^ with G X^ and Aj be an averaging op- 
erator on X<j\ that averages out xp\ j = 1, 2. Then a function m(-) defined 
on the product domain has a functional ANOVA decomposition as in the 
following: 

2 

m = J\(I- Aj + Aj)m 
i=i 

(2.2) = {A^ 2 + (/ - Ax)A 2 + A 1 (I- A 2 ) + (I - A^I - A 2 )}m 
= ju + mi+m 2 + mi2, 

where /i is the overall mean, 7714,777-2 are the main effects of the two genes 
and 777-12 describes the interaction effect between them (see Appendix A for 
more details). 

2.2. Reproducing kernel Hilbert space and the dual representation. Based 
on the ANOVA decomposition, a reproducing kernel Hilbert space (RKHS) % 
of functions on X can be constructed [Gu and Wahba (1993) and Wahba 
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et al. (1995)]. Let n ij) be an RKHS of functions on j = 1,2, and 1® 
be a space of constant functions on then 

2 

3=1 

(2.3) = [i] e ® i (2) ] e ® n {2) ] © (n {1) ® ?^ (2) ) 
= [i]©^ 1 ©^ 2 ©^ 3 , 

where © refers to direct sum and © refers to tensor product. Equation (2.3) 
provides an orthogonal decomposition of the entire functional space % . So H. 
is a RKHS with the associated reproducing kernel as the sum of the repro- 
ducing kernels of these component subspaces. Each functional component 
in (2.2) lies in a subspace in (2.3), and is estimated in the corresponding 
RKHS. The identifiability of the components is assured by side conditions: 
IxU) «^( x(i) ) dfj,j = 0, j = 1, 2. 

We assume that function m is a member of the RKHS T~L and can be 
estimated as the minimizer of the following penalized sum of squares: 

n 

(2.4) £(y,m) = ^(y i -m(x i )) 2 + AJ(m), 

i=l 

where </(•) is a roughness penalty. With the orthogonal decomposition of 
space T~L, the penalty function J(-) can be decomposed such that equa- 
tion (2.4) becomes 

n 3 

(2.5) £(y,m) = ~ ™(*i)) 2 + £ \i\\P l m(-)\&, 

i=l 1=1 

where P l is the orthogonal projector in % onto H l , and the Aj's are the tuning 
parameters which balance the goodness of fit and complexity of the model. 
The minimizer of the objective function (2.5) is known to have a representa- 
tion [Wahba (1990), Chapter 10] in terms of a constant and the associated 
reproducing kernels {ki(s,t)} of the H l ,l = 1,2,3, that is, 

n 3 

i=i i=i 

3 

i=i 

where Kf{x.) = (fc/(xi,x), . . . , /cj(x n , x)), C\ = (ci, . . . , c n ) T 6i. Details on the 
choice of the reproducing kernel functions corresponding to the three sub- 
spaces will be discussed in a later section. 



m(x) 



(2.6) 
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Substituting the representation of m(-) into (2.5), we get 

n 3 

C(y,m) = J> 4 - m(x,)) 2 + ^X l \\P l m(-)\\ 2 n 



i=i 



i=i 



(2.7) 



(y - m(X)) T (y - m(X)) + £ A,Cf K,C« 



z=i 



where X = (x^f, . . . , x^) 1 and 



T\T 



K, 



r ^ T (xi) 
^ T (x 2 ) 



The gradients of C with respect to the coefficients (fi, C\ : I = 1, 2, 3) are 



and 



9Ci 



2 Kf| 



Ml-^K^ +A,K,C,|. 



Therefore, the first order condition is satisfied by the system 

: 2 i t k 3 



(21 



n 

Kfl KfKi + AiKi 



l T Ki 1 T K 2 1 T K, 



Kfl 

.Kg*! 



KfK 2 
K^K 2 + A 2 K 2 
KjK 2 



KfK 3 
K^Ks 
K^K 3 + A 3 K 3 









Ci 




c 2 




Lc 3 J 



y- 



The connection between smoothing splines and the linear mixed effects 
model has been previously established [Wahba (1990), Speed (1991)]. For 
the two-way ANOVA decomposition model considered in this paper, we 
show that the first order system above is equivalent to Henderson's normal 
equation of the following linear mixed effects model (see Appendix B for 
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details): 

(2.9) y = fil + mi + m 2 + mi2 + e, 

where mi, 7712,77112 are independent n x 1 vector of random effects; mi ~ 
^(O.rfKi), m 2 ~iV(0,T 2 2 K2), m i2 ~ iV(0, r|K 3 ), and e ~ N(0,a 2 I) is in- 
dependent of mi, 77t2 and mi2- This connection indicates that the estima- 
tors of functions mi,J7i2, m.12 are just the BLUPs of the linear mixed effects 
model [see also Liu, Lin and Ghosh (2007)]. Tuning parameters A;, I = 1, 2, 3, 
are functions of the variance components, which can be estimated either by 
the maximum likelihood method or by the restricted maximum likelihood 
(REML) method. Since the REML method produces unbiased estimators for 
the variance components, we adopt REML estimation in this work. The dual 
representation of the linear mixed effects model obtained for the SS-ANOVA 
model makes it feasible to do inferences about the main and interaction com- 
ponents under the mixed effects model framework. 

2.3. Choice of kernel function for genotype similarity. The choice of a re- 
producing kernel is not arbitrary in the sense that the kernel function must 
be nonnegative definite. By Theorem 2.3 [Gu (2002)], given a nonnegative 
definite function k on X, we can construct a unique RKHS of real- valued 
functions on X with k as its reproducing kernel. In genetic association stud- 
ies, a kernel function captures the pairwise genomic similarities across mul- 
tiple SNPs in a gene. It projects the genotype data from the original space, 
which can be high dimensional and nonlinear, to a one-dimensional linear 
space. The allele matching (AM) kernel is one of the most popularly used 
kernels for measuring genomic similarity. This type of kernel measure has 
been used in linkage analyses [Weeks and Lange (1988)] and in association 
studies [Tzeng et al. (2003), Schaid et al. (2005), Wessel and Schork (2006), 
Kwee et al. (2008), Mukhopadhyay et al. (2010) and Wu et al. (2010)]. For 
a review of genomic similarity and kernel methods, readers are referred to 
Schaid (2010a, 2010b). With the notable strength that it does not require 
knowledge of the risk allele for each SNP, the AM kernel is chosen as the 
kernel function in this study. This similarity kernel counts the number of 
matches among the four comparisons between two genotypes gi yS (with two 
alleles A and B) and gj s (with two alleles C and D) of two individuals i 
and j at locus s, and can be expressed as 

AM(g itS = A/B, g hS = C/D) = I(A = C)+I{A = D) + I{B = C) + I(B = D) , 

where I is the indicator function and "=" means the two alleles are identical- 
by-state (IBS). The kernel function based on this AM similarity measure 
then takes the following form: 

(2-10) fM = ^J^lM , 

where S is the number of SNPs considered for each kernel function. 
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To incorporate valuable SNP-specific information into analyses in order 
to potentially improve performance, a weighted-AM kernel can be applied 
which has the following form: 

(2.H) fi»,»)= ^"d^'*"\ 

where w s is the weighting function which can be adopted to incorporate 
prior knowledge in order to gain extra power. For example, when a study 
is trying to identify the effect of rare variants, the weight function can be 
taken as the inverse of the minor allele frequency to boost the signal for 
rare variants [Schaid (2010b)]. An example illustrating the calculation of 
the kernel matrix is given in Appendix C. 

We use the AM kernel as the reproducing kernel for the two subspaces H 1 
and H 2 corresponding to the main effects. Utilizing the fact that the repro- 
ducing kernel of a tensor product of two reproducing kernel Hilbert spaces is 
the product of the two reproducing kernels [Aronszajn (1950)], the associated 
reproducing kernel for "H 3 can be taken as the product of the reproducing 
kernels of the two subspaces: H 1 and H 2 . 



2.4. Hypothesis testing. 



2.4.1. Testing overall genetic effect. In a gene-based genetic association 
study, one is interested in whether a gene as a system is associated with 
a disease trait. In the proposed 3G interaction study, we are interested in 
the association of each gene with a quantitative trait as well as the inter- 
action between genes, if any. The analysis starts with a two-dimensional 
pairwise search for gene pairs with overall contribution to the phenotypic 
variation and then tests those contributing gene pairs for interaction effect. 
Under the SS-ANOVA framework, testing the overall genetic effect of a gene 
pair is equivalent to testing Hq : mi = rri2 = m\2 = 0. Similarly, testing for 
interaction effect can be formulated as Hq : m\i = 0. With the linear mixed 
effects model representation, the aforementioned two tests are equivalent to 
Hq \ t 2 = Tij = Tg = and Hq-.t 2 = 0, respectively. Here, ,rf ,rf are the 
variance components in model (2.9). 

A well-known issue in variance component analysis is that the parame- 
ters under the null hypotheses are on the boundary of the parameter space. 
Moreover, the kernel matrices K/s are not block-diagonal. Thus, the asymp- 
totic distribution of a likelihood ratio test (LRT) statistic does not follow 
a central chi-square distribution under the null hypothesis. The mixture 
chi-square distribution proposed by Self and Liang (1987) under irregular 
conditions also does not apply in our case. In this paper, we construct score 
test statistics based on the restricted likelihood. Consider the linear mixed 
model in (2.9), y ~ N(fil, V(/3)). The restricted log-likelihood function can 
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be written as 

i R oc -\ ln(|F(/3)|) - \ ln(|l r V^" 1 C9)l|) - \{y - fil) T V(P)-\y - AI), 

where (3 = (a 2 ,t\ ,t^t 2 ) t , V(fi) = a 2 1 + t 2 K x + t 2 K 2 + r|K 3 . The first 
order derivative of the restricted log-likelihood function with respect to each 
variance component is 

(2.12) H = -±tr(w) + \(y- Aif^W^-^Xy - Ai), 

where VJ = = 1, ... ,4, so that Vi =I,V 2 = Ki, V3 = K 2 , V 4 = K 3 and 

R = V~ 1 - V~*1(1 T V~ 1 1)~ 1 1 T V~ 1 . 

The restricted score function under the null hypothesis Hq :t 2 = r| = 
r| = is given by 



dl R 



= ~ tr(P Vi) + ^ (y - Ai) T ^(y - £i), 



1 -'2" 

where Po = I — is the projection matrix under the null. Thus, 

Hq can be tested using the following score statistic: 

1 3 

where Ao = (I ~~ Po)y is the MLE of under the null. This leads to 

3 



S(a 2 ) = ^y T P ^TK l P y. 

a 1=1 

Denoting the true value of a 2 under the null by erg, S(c7q) is a quadratic 
form in y. Following Liu, Lin and Ghosh (2007), we use the Satterthwaite 
method to approximate the distribution of S(<Tq) by a scaled chi-square 
distribution, that is, S(uq) ~ ax 2 ,, where the scale parameter a and the 
degrees of freedom g can be estimated by the method of moments (MOM). 
By equating the mean and variance of the test statistic S^Og) with those 
of ax^, we have 

5 = E[S(a 2 )} = tr (p Q ^ kA jl = E[a X 2 g ] = ag, 
v = Var[5(a 2 )] = tr ( £(P Ki) ^(PqK;) J h = Var^] = 2 a 2 g. 



\i=i i=i / 

Solving for the two equations leads to a = u/25 and g = 2b 2 jv. 

In practice, we do not know the true value <Tq and we usually replace it by 
its MLE under the null model, denoted by a\. The asymptotic distribution 
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of 5 , (o'q) can still be approximated by the scaled chi-square distribution 
because the MLE is \fn consistent. To account for this substitution, we 
estimate a and g by replacing v by v based on the efficient information. The 
elements of the Fisher information matrix of r = (Tf ,t|,t|) are given by 



tr(P KiP Ki) tr(P KiP K 2 ) 
tr(P K 2 P Ki) tr(P K 2 P K 2 ) 
tr(P K3PoK!) tr(P K 3 P K 2 ) 



tr(P KiP K 3 ) 
tr(P K 2 P K 3 ) 
tr(PoK 3 PoK 3 ) 



MMPqKi) tr(P K 2 ) tr(P K 3 )] T 



and I a i a i = g tr(PoPo). Then the efficient information I T 



I T 



T- 1 T 

and v = Var[S'((7 2 )] ~ SUM[/ TT ], where operator "SUM" indicates the sum 
of every element of the matrix. 



2.4.2. Testing for Gx G interaction. For testing the interaction effect, 
that is, testing Hq : t| = 0, we also apply a score test. Let E = a 2 1 + rf Ki + 
t|K 2 . The score function (2.12) under this null hypothesis becomes 



d£ R 



dri 



i[tr(P 01 K 3 ) - (y - /il) T S- 1 K 3 S- 1 (y - £1)] 



y T P 01 K 3 Poiy), 



where Poi = E~ 
the null. Then 



E 1 1(1 T E 1 1) 1 1 T E 1 is the projection matrix under 



Si = iy T P iK 3 P 01 y. 



Similarly, the Satterthwaite method is used to approximate the distri- 
bution of Sj by ciiXgj- Parameters aj and gi are estimated by MOM. 
Specifically, aj = uj/25i and gj = 2Sf/uj, where 5j = 2-tr(PoiK 3 ) and uj = 
itr(P iK 3 P iK 3 ) - i^A- 1 ^, in which 



$ = [tr(P 2 1 K 3 ) tr(P iK 3 PoiKi) tr(P iK 3 P 01 K 2 



and 



A: 



tr(P 2 !) 



tr(P n 2 iK 



tr(P 2 iK! 



tr(P^K 2 ) 

oijvi; tr(PoiKiPoiKi) tr(P iKiP iK 2 ) 
tr(P&K 2 ) tr(P iK 2 P iKi) tr(P iK 2 P 01 K 2 ) 



3. Simulation study. 



3.1. Simulation design. Monte Carlo simulations were conducted to eval- 
uate the performance of the proposed method for detecting overall genetic 
effects as well as interaction between two genes. Genotype data were simu- 
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lated using the MS program developed by Hudson (2002). The MS program 
generates haplotype samples by using the standard coalescent approach in 
which the random genealogy of a sample is first generated and the mutations 
are randomly placed on the genealogy. We first simulated two independent 
samples of haplotypes. Parameters of the coalescent model were set as fol- 
lows: (1) the diploid population size Nq = 10,000; (2) the mutation parame- 
ter 9 = 4iVo/i = 5.610 x 10 -4 /6p; and (3) the cross-over rate parameters were 
p = ANqt = 4.0 x 10 _3 /6p and p = 8 x 10~ s /bp for the two samples. In each 
sample, 100 haplotypes were simulated for a locus with 10 kb long and the 
number of SNP sequences was set to be 100. Two haplotypes were then ran- 
domly drawn within each simulated haplotype pool and paired to form the 
genotype on the locus for an individual. For each individual, we randomly 
selected 10 adjacent SNPs with minor allele frequency (MAF) greater than 
5% to form a gene. This was done separately for each simulated haplotype 
pool. Finally, we had genotypes for n individuals for two separate genes with 
10 SNPs each, and the two genes were independent. 

When simulating phenotypes, four scenarios were considered (Figure 1). 
In scenario I [Figure 1(A)], the genetic effects were all set to zero so that 



■ 200 «500 1000 «n=200 ■ n=500 ■ n=1000 




Fig. 1. The empirical type I error (A) and power (B)-(D) of the three methods, where 
KM, fPCA and pPCA refer to the proposed kernel machine method, the full and partial 
PGA methods, respectively. Different heritability levels (H2) were assumed. The variance 
terms (a 2 ,-rf ,r| ,r| ) corresponding to different heritability levels (H2 = 0.05, 0.1, 0.2) were 
given as in: (B) (0.8, 0.021, 0.021, 0), (0.8, 0.044, 0.044, 0), (0.8, 0.1, 0.1, 0); (C) (0.8, 
0.016, 0.016, 0.088), (0.8, 0.036, 0.036, 0.018), (0.8, 0.08, 0.08, 0.04); and (D) (0.8, 
0.011, 0.011, 0.022), (0.8, 0.022, 0.022, 0.044), (0-8, 0.05, 0.05, 0.1). 
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we could assess the false positive control. In scenario II [Figure 1(B)], we 
considered the main effects for the two genes, but set the interaction effect as 
zero. In scenarios III [Figure 1(C)] and IV [Figure 1(D)], both main effects 
and interaction effect were considered. The difference between scenarios III 
and IV is that the interaction effect in scenario III is smaller than the main 
effect, while in scenario IV it is larger than the main effects. Quantitative 
traits of interest were simulated from a multivariate normal distribution with 
mean /xl nx i an d variance-covariance matrix V = o~ 2 I + t 2 ~Ki+t 2 K.2 + t 2 K.3, 
where t 2 ,t 2 ,t^ took different values under different scenarios; IQ, t =1,2,3, 
are the kernel matrices using the allele matching method described before. 
Different sample sizes (n = 200, 500 and 1000) and different heritability 
(H 2 = 0.05, 0.1, 0.2) were assumed. Let a 2 G = t? + t$ + t$. The heritability 
was defined as H 2 = o~ G /(o~ G + a 2 ). In all simulation scenarios, we fixed the 
residual variance a 2 = 0.8, and t 2 and r| were set to be equal. 

3.2. Model comparison. We compared our simulation results with two 
other methods described in the following. Wang et al. (2009) proposed an in- 
teraction method using a partial least squares approach which was developed 
specifically for binary disease traits. The method cannot be applied for quan- 
titative traits. However, in Wang et al.'s paper they compared their method 
with a regression-based principle component analysis method. Specifically, 
assuming an additive model for each marker in which genotypes AA, Aa 
and aa are coded as 2, 1, 0, respectively, the singular value decomposition 
(SVD) can be applied to both gene matrices. Let Gj = (Si, S2, . . . , 5x .) be an 
n x Lj SNP matrix for gene j (=1,2). The SVD for Gj can be expressed as 
Gj = UjDjVj , where Dj is a diagonal matrix of singular values, and the ele- 
ments of the column vector Uj are the principal components Uj , Uf , . . . , U™ 3 

J J J J 

{m,j < Lj is the rank for Gj ) . An interaction model can be expressed as 



where 7 represents the interaction effect between the first pair of PCs corre- 
sponding to the largest eigenvalues in the two genes. The main effect of each 
gene is modeled through the sum of all single marker effects. For simplicity, 
only one interaction effect between the first PC corresponding to the largest 
eigenvalues in each gene was considered in Wang et al. (2009). We followed 
Wang et al. (2009) and compared the performance of our model with this 
one. 

In principle, one can select PCs for each gene based on the proportion of 
variation explained (say, >85%). Then, pairwise interactions can be consid- 
ered for all selected PCs in model (3.1). Thus, if we replace the main effect 
of each gene in model (3.1) with PCs rather than single SNPs to reduce the 



(3.1) 




h=i 
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model degrees of freedom, model (3.1) then becomes 
Pi P 2 Pi P 2 

(3.2) y = »l+J2 Pm K + E b» U l + E E Tpim t£ . 

Pl=l p 2 =l Pl = lP2 = l 

where E/p^j = 1,2, represents the PCs for gene j, and = 1,2, is cho- 
sen based on the proportion of variation explained by the number of PCs 
in gene j. With this regression model, we considered all possible pairwise 
interactions of the selected PCs. GxG interaction was assessed by testing 
Ho : 7pip 2 = 0' f° r au Pi an d P2- This model was applied by He et al. (2010) 
in their gene-based interaction analysis. 

In addition to the two models above, we also compared our gene-centric 
approach to a simple pairwise SNP interaction model. Details of the compar- 
ison are given in Section 3.3. For a given simulation scenario, 1000 simulation 
runs were conducted. Type I error rates and power were examined at the 
nominal level a = 0.05. 

3.3. Simulation results. 

3.3.1. Comparison of the three methods for the overall genetic test. We 
first evaluated the type I error rate and the power of the three meth- 
ods for testing the overall genetic effects (i.e., Hq : t 2 = r| = r| = 0). Fig- 
ure 1 summarizes the comparison results between our kernel machine (KM) 
method and the partial PC A (pPCA) [model (3.1)] and the full PC A (fPCA) 
[model (3.2)] methods. In Figure 1(A), we can see that our method has the 
empirical type I error rate reasonably controlled for the overall genetic effect 
test. The partial PCA-based interaction model generates very conservative 
results. 

In all simulations we fixed the residual variance a 2 to 0.8, and changed 
the three genetic effects to get different heritability levels. For scenario II 
[Figure 1(B)], data were simulated assuming no interaction (i.e., r| = 0). 
Then t 2 and r| were calculated for a given heritability level. For example, 
when H 2 = 0.05, t 2 = r| = 0.021. In scenarios III and IV, the interaction 
effect was set as either half of the main effects (scenario III) or twice the 
main effects (scenario IV). As we expected, the testing power increases as 
the heritability level and sample size increased [Figure 1(B)-(D)]. For a fixed 
heritability level, the KM method always outperformed the other two under 
different sample sizes. The power difference of the KM method over the other 
two is more striking under a small heritability level (say, 0.05) or when the 
sample size is small. For most complex diseases in humans, the heritability 
of a genetic variant is generally small. Thus, the KM method is preferable 
over the other two in the first stage of the interaction analysis. In addition, 
we also observed that the KM method is insensitive to whether the genetic 
effect is due to the main effect or the interaction, whereas the PCA-based 
method gains power as more of the genetic effect is due to the interaction. 
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3.3.2. Comparison of the three methods for the interaction test. Interac- 
tion may be due to a variety of underlying mechanisms. Some genes might 
have both significant main effects and interaction effect, while others might 
only incur interaction effect without main effects. Simulation studies were 
designed to compare the performance of the proposed KM method in dis- 
covering gene x gene interactions over the other two methods, considering 
different interaction effect sizes. Since the power of an interaction test is 
largely determined by the size of the interaction effect, we simulated data 
assuming different proportions of interaction effects among the total genetic 
variance. This proportion is defined by rj = t 2 /t 2 , where r 2 (=t 2 +r| + t|) 
refers to the total genetic variance. For a fixed total genetic variance, the 
value of rj indicates the strength of the interaction effect between two genes. 
For a fixed residual variance (a 2 = 0.8), the total genetic variance is set to 
0.2 when H 2 = 0.2 and to 0.53 when H 2 = 0.4. The variance size for the 
two main effects were set to be equal, so we could calculate the interac- 
tion variance. For example, (t 2 ,t$,t$) = (0.08, 0.08, 0.04) when rj = 0.2, and 
( r iV 2 Vf) = (0.02,0.02,0.16) when 77 = 0.8 under H 2 = 0.2. Six values of the 
proportion rj = (0, 0.2, 0.4, 0.6, 0.8, 1.0) were considered, including the two ex- 
treme cases: no interaction at all (rj = 0) and pure interaction (rj = 1). The 
method was compared with the other two PCA-based interaction analyses 
under two different sample sizes, 500 and 1000. 

Figure 2 shows power comparison (based on 1000 replicates) under two 
different heritability levels [Figure 2(A) for H 2 = 0.2 and Figure 2(B) for 
H 2 = 0.4]. The type I error (when rj = 0) for the interaction test is reason- 
ably controlled for the three methods. As we expected, the interaction power 
increases as the interaction effect size (rj) increases. Among the three meth- 
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Fig. 2. Power comparison of the proposed KM model (solid line), pPCA model (3.1) 
(dashed line) and fPCA model (3.2) (dotted line) under different sample sizes (n) and 
different interaction sizes (rj), where rj refers to the proportion of interaction effect among 
the total genetic effect. (A) H 2 =0.2 and (B) H 2 = 0.4. 
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ods, our KM method has the highest power. The partial PCA model (3.1) 
has the lowest power. This demonstrates that only considering interaction 
of the first principle component in each gene is not enough to capture the 
interaction between two genes. The effect of sample size on the interaction 
power is also significant. Large sample size always leads to large power. 

A common concern in detecting gene-gene interactions is the computa- 
tional burden. As we focus on genes as testing units, the total number of tests 
reduces dramatically in a gene-centric analysis compared to a single-marker 
based analysis. In addition, computation time can be saved by implementing 
the two-stage analysis: (1) doing score tests in the first stage to assess the 
significance of all genetic components; and (2) testing interaction only for 
those pairs showing statistical significance after multiple testing adjustment 
in the first stage. The computation for the first step is very fast since only 
parameters under the null model, which is a regular linear regression model, 
need to be estimated. In general, only a small fraction of signals can pass 
the significant threshold in the first stage, leaving the second stage inter- 
action test with less computational burden. For a quick comparison of the 
KM method with the PC-based approaches, when n = 500, the computation 
time with 1000 simulation replicates for KM, pPCA and fPCA are 28 min, 
24.5 min and 24 min, respectively, for the overall test. When testing the 
interaction term, the KM method takes about 9sec for a single run on aver- 
age, and is slower than the other two methods. However, this should not be 
a concern in real applications since the number of interactions is generally 
not very large after the first stage screening. 

In summary, our KM method outperforms the other two methods in the 
overall genetic test as well as in the interaction test under different simula- 
tion scenarios. The results also indicate that large sample sizes are needed 
for the detection of the interaction term compared to the detection of main 
effects. Even though the PCA-based analysis has been applied for candi- 
date gene-based association analyses [Wang and Abbott (2008), Wang et al. 
(2009)], our simulation studies show that it may not be suitable for interac- 
tion analysis. 

3.3.3. Comparison with a single SNP interaction model. In a regression- 
based analysis for interaction, the commonly used approach is the single 
SNP interaction model with the form 

(3.3) yi = /3 + P\Su+ P2S21+ (3i2SuS 2 i + £», i = l,2,...,n, 

where j3o is the intercept; j3 2 and fi\ 2 represent the effects of SNP S\ in 
gene 1, SNP S 2 in gene 2 and the interaction effect between the two, respec- 
tively; £j ~ iV(0, a 2 ). We simulated data according to model (3.3) assuming 
a MAF pa = 0.3. Different heritabilities and different sample sizes were as- 
sumed. For simplicity, we assumed the same effect size for the three coef- 
ficients which are calculated under specific heritability (H 2 = 0.2 and 0.4) 



16 S. LI AND Y. CUI 



Table 1 

List of empirical type I error and power based on 1000 simulation runs when data 
were simulated with model (3.3) 
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0.003 






^oo 
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0.057 


0.003 






1000 


0.052 


0.017 


0.059 


0.003 




(0.19, 0.19, 0.19, 0) 


200 


0.497 


0.03 


0.534 


0.032 






500 


0.923 


0.045 


0.911 


0.046 






1000 


0.999 


0.048 


0.997 


0.053 




(0.19, 0.19, 0.19, 0.19) 


200 


1 


0.221 


1 


0.183 






500 


1 


0.419 


1 


0.349 






1000 


1 


0.714 


1 


0.635 


0.4 


(0.51, 0, 0, 0) 


200 


0.053 


0.022 


0.053 


0.003 






500 


0.049 


0.016 


0.062 


0.001 






1000 


0.054 


0.024 


0.057 


0.008 




(0.51, 0.51, 0.51, 0) 


200 


1 


0.051 


1 


0.058 






500 


1 


0.062 


1 


0.067 






1000 


1 


0.054 


1 


0.058 




(0.51, 0.51, 0.51, 0.51) 


200 


1 


0.850 


1 


0.648 






500 


1 


0.996 


1 


0.964 






1000 


1 


1 


1 


1 



P and Pi refer to the power for testing the overall genetic effects (i.e., Hq:t( = r| = 
rf = for the kernel approach and Hq : fi\ = /?2 = /3i2 = for the pairwise SNP interaction 
analysis) and for testing interaction effect (i.e., _Ho:rf = for the kernel approach and 
Hq : P12 = for the pairwise SNP interaction analysis) , respectively. 



when generating the data. We considered an extreme case in which each 
gene only contains one single SNP. Data generated with model (3.3) are 
subject to both the single SNP interaction and the proposed kernel inter- 
action analysis. With this simulation we tried to assess how robust the KM 
method is when there is only one pair of functional SNPs in two genes. 

Table 1 shows that both models show comparable type I error control 
for the overall genetic test (see P Q in the table). For the interaction test, it 
looks like the kernel approach generates more conservative results. Here the 
interaction test is nested within the overall genetic test. If we aggregate the 
results by dividing Pj by P Q , the single SNP analysis actually produces more 
inflated false positives compared to our kernel approach when no genetic 
effect is involved. When data were simulated assuming only main effects but 
no interaction (case /3i2 = 0), the two approaches yield very similar false 
positive rates, indicating reasonable performance of the kernel approach for 
false positive control. 
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For the power analysis, we found little difference between the two methods 
for the overall genetic test (P ), especially under large sample size and high 
heritability level. For the interaction test (Pi), the power increases as sample 
size and heritability level increase. We observed relatively large power differ- 
ences between the models when the sample size is small (<500). As sample 
size increases, the difference diminishes. For example, the power difference is 
0.2 under n = 200 and H 2 = 0.4, which is reduced to 0.032 when n increases 
to 500 and to zero when n = 1000. When more than one functional SNP 
within each gene is involved in interacting with one another to affect a trait 
variation, the kernel method consistently outperformed the single SNP inter- 
action model (data not shown). In summary, our model performs reasonably 
well in different scenarios compared to the other methods. Even when there 
is only one single SNP pair interacting with each other in two genes, our 
analysis produces results as good as the ones analyzed with the true model, 
especially under large sample sizes and high heritability (Table 1). 

4. Applications to real data. 

4.1. Application I: Analysis of birth weight data. A candidate gene study 
was initially conducted in order to study genetic effects associated with be- 
ing large for gestational age (LGA) and small for gestational age (SGA). 
Subjects were recruited through the Department of Obstetrics and Gyne- 
cology at Sotero del Rio Hospital in Puente Alto, Chile, and SNPs were 
selected for genotyping in order to capture at least 90% of the haplotypic 
diversity of each gene. Individuals were genotyped at 797 SNP markers in 
186 unique candidate genes. Missing genotypes were imputed using a condi- 
tional probability approach as described in Cui et al. (2008). We combined 
the two data sets (LGA and SGA) and used baby's birth weight (in kg) as 
the response variable to assess if there are any genes or interaction of genes 
that could explain the normal variation of new born baby's birth weight. 
The case/control classification in the two data sets was based on baby's 
birth weight together with mother's gestational age. A total number of 1511 
individuals in the case and control combined data set were used for analy- 
sis after removing 14 individuals with birth weight 3 x IQR (inter-quartile 
range) above Q3 or below Q\. After removing these extreme observations 
and a Box-Cox transformation, the distribution of the birth weight data in 
the combined sample was approximately normal. 

A two-dimensional pairwise G x G interaction search was conducted (total 
17,205 gene pairs). The score test for testing Hq :rf = r| = r| = was deter- 
mined and p-values were obtained for all gene pairs. For a two-dimensional 
search, it is not clear how to set up a genome-wide threshold to correct for 
multiple testings. Obviously, the 17,205 tests are not independent and it is 
too stringent to use the Bonferroni correction method. Thus, we used an 
arbitrary threshold of 0.001 as a cutoff. Totally, 23 gene pairs were found 
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Table 2 

List of gene pairs with p-value less than 0. 001 in the overall genetic effect test. Gene 
pairs with significant interaction effect (p-valuei <0.05) are indicated with bold font 
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p-valuei 
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0.0340 
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3232 
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0.2410 




F3 


0056 


1.5E-06 


0.0061 


0.3242 


000781 
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GP1BA 


0.0118 


0362 


<10~ 6 


3229 


000283 


0.7462 




IGF1 


0.0124 


0.0090 


<RT 6 


0.3234 


0.000259 


0.5133 




IL1B 


0.0118 


0.0049 


<KT e 


0.3227 


0.000554 


0.6228 




IL9 
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0.0151 


0.0129 


0.3210 


000066 


0.3849 
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MMP7 


0.0122 


0.0051 


<10~ 6 


0.3236 


0.000518 


0.6548 




OXTR 


0.0007 


1.1E-06 


0.0124 


0.3218 


0.000447 


0.2107 




PLAUR 


0.0128 


0.0396 


<10" 6 


0.3194 


0.000536 


0.5460 




PTGER3 


0.0057 


1.7E-07 


0.0051 


0.3240 


0.000279 


0.0434 




PTGS2 


0.0127 


0.0058 


<KT e 


0.3226 


0.000514 


0.7092 




TIMP2 


0.0075 


LIE 07 


0.0043 


0.3238 


0.000916 


0.2351 




TLR4 


0.0122 


0.0115 


<10~ 6 


0.3239 


0.000581 


0.5606 


PTGS2 


ANG 


0.0062 


0.0182 


<10~ 6 


0.3218 


0.000416 


0.7016 




EDN1 


0.0054 


0.0030 


<KT 6 


0.3243 


0.000730 


0.7966 




LPA 


0.0055 


0.0062 


<kt 6 


0.3239 


0.000988 


0.5328 




PDGFB 


0.0010 


1.5E 06 


0.0042 


0.3246 


0.000782 


0.2739 




PGF 


0.0261 


7.4E-08 


0.0044 


0.3240 


0.000850 


0.0062 




PLAU 


0.0004 


3.0E-06 


0.0057 


0.3231 


0.000260 


0.0207 


IL9 
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0.0082 


0.0274 


0.3220 


0.000936 
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0.0174 


2.8E 08 


0.0075 


0.3227 


0.000540 
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a 1 is the residual variance; p-value is for testing Ho :t 2 = t% =t's =Q and p- valuer is for 
testing Ho :r| = 0. Note that the estimated interaction effect size does not necessarily 
indicate how strong an interaction is. The strength of an interaction is rather reflected by 
the interaction testing p-value (denoted by p-value^). 



to be significant with this cutoff. A detailed list of these gene pairs, their 
effect estimates and the p- values for the overall genetic and interaction test 
are shown in Table 2. Among the 23 gene pairs, five significant G x G inter- 
actions were detected at the 0.05 level. These are gene pairs ANG-EDN1, 
PDGFC-PTGER3, PTGS2-PGF, PTGS2-PLAU and IL9-IGF1 (indicated 
by bold fonts in Table 2). If we increase the overall testing p-value threshold 
(to 0.0005), only 9 gene pairs will remain in the list and only 2 gene pairs 
will show significance. 

The results indicate a strong genetic effect for gene PDGFC (Platelet- 
derived growth factor C). This gene is a key component of the PDGFR-a 
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signaling pathway. Studies have shown that PDGFC contributes to normal 
development of the heart, ear, central nervous system (CNS) and kidney 
[Reigstad, Varhaug and Lillehaug (2005)]. Even though its main effect is 
very strong, the only gene found to have significant interaction effect with 
this gene was gene PTGER3 (p- value* = 0.0434). 

Among the five interacting gene pairs, the interaction between genes ANG 
(Angiogenin) and EDN1 (Endothelin 1) shows the strongest interaction sig- 
nal (p-valuej < 10 -5 ). Study has shown that dysregulation of angiopoietins 
is associated with low birth weight [Silver et al. (2010)]. Nezar et al. (2009) 
studied the role of endothelin 1 in preeclampsia and nonpreeclampsia women, 
and found that EDN1 correlates with the degree of fetal growth restriction. 
Although no study has reported the interaction between the two genes, our 
finding suggests a potential role of interaction between the two genes in af- 
fecting fetal growth. Further functional analysis is needed to validate this 
result. 

Interactions were also found between gene PTGS2 (Prostaglandin-endo- 
peroxide synthase 2) and genes PLAU (Urokinase-type plasminogen acti- 
vator) and PGF (Placental growth factor), and between gene IL9 (Inter- 
leukin 9) and IGF1 (Insulin-like growth factor 1). It has been recognized 
that genes PGF and IGF1 are associated with fetal growth [Torry et al. 
(2003), Osorio et al. (1996)]. The identification of interactions of the two 
genes with other genes provides important biological hypotheses for further 
lab verification. 

4.2. Application II: Analysis of yeast eQTL data. The second data set 
we analyzed with our model is a well studied yeast eQTL mapping data set 
generated to understand the genetic architecture of gene expression [Brem 
and Kruglyak (2005)]. The data were generated from 112 meiotic recombi- 
nant progenies of two yeast strains: BY4716 (BY: a laboratory strain) and 
RMll-la (RM: a natural isolate). The data set contains 6229 gene expres- 
sion traits and 2956 SNP marker genotype profiles. As an example to show 
the utility of our approach to an eQTL mapping study, we picked the ex- 
pression profile of one gene (BAT2) as the quantitative response to identify 
potential genes or gene-gene interactions that regulate the expression of this 
gene. Note that one of the parental strains, RMll-la, is a LEU2 gene knock- 
out strain. Thus, we expect strong segregation of this gene in the mapping 
population. Since BAT2 is in the downstream of the Leucine Biosynthesis 
Pathway (LEU2 is in the upstream of the same pathway) , this motivates us 
to pick BAT2 as the response [see Figure 5(a) in Sun, Yuan and Li (2008)]. 
The Bonferroni correction was applied to adjust multiple testings for the 
1,072,380 gene pairs. An overall test for pairs of gene effects was conducted 
followed by the score test for interaction if the overall test is significant. 

There were 1465 genes with some containing a single SNP marker. All the 
genes were subject to the proposed kernel interaction analysis. Figure 3(A) 



20 



S. LI AND Y. CUI 



-loglO(p) 




Fig. 3. Analysis of the expression profile for BATS. The —log 10 transformed p-value 
profile plot of all gene pairs for the overall test (A) and the interaction test (B). The 
yellow hyperplane in (A) represents the Bonferroni cutoff. 



shows the pairwise interaction plot for — log 10 transformed p- values associ- 
ated with the overall genetic test. The yellow hyperplane indicates the Bon- 
ferroni correction threshold. Data points with p- values larger than 10 -4 were 
masked. The plot indicates a strong genetic effect at chromosomes 3 and 13, 
which implies that the two locations are potential regulation hotspots. In 
checking the recent literature, we found that the two positions were reported 
as eQTL hotspots in a number of studies [e.g., Brem et al. (2002), Perlstein 
et al. (2007), Li, Lu and Cui (2010)]. 

Out of the 1072380 gene pairs, 87 pairs were found to have significant in- 
teractions at an experiment-wise significant level of 0.05. Figure 3(B) plots 
the pairwise significant interactions. Circles correspond to significant in- 
teraction pairs, with the darkness of the color indicating the strength of 
the interaction. We saw a strong interaction pattern on chromosome 13. 
One or several genes at this location interact with many other genes to af- 
fect the transcription of gene BAT2. Another interaction "hotspot" is at 
chromosome 3 where genes (containing LUE2 and its neighborhood genes) 
interact with genes at chromosomes 5, 13 and 15 to regulate BAT2 expres- 
sion. We used Cytospace [Shannon et al. (2003)] to generate an interaction 
network map (see Figure 4). Each node represents a gene and the thick- 
ness of the connection line indicates the strength of the interaction effect. 
Genes at the same chromosome location are clustered together in the plot. 
Light nodes with oval shapes indicate weak or no main effect. We found 
strong main effects for genes on chromosomes 3 and 13. The strongest in- 
teraction effect is between genes on chromosome 3 and chromosome 13. We 
also highlighted (red lines) the interaction between genes on chromosome 3 
and others. Among the genes with no main effects (light oval nodes), URA3 
is known as a transcription factor [Roy, Exinger and Losson (1990)]. Even 
though it does not show any main effect, it shows interactions with several 
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Fig. 4. The network graph of interacting genes for the eQTL analysis. Each node rep- 
resents a gene; nodes with light oval shape indicate no main effect. Interacting genes are 
connected with lines (the thicker the lines, the stronger the interaction). Genes with dif- 
ferent colors are located in different chromosomes with significant main effects. 



other genes on chromosome 3 to regulate the expression of BAT2. The re- 
sults also imply the important role of several loci on chromosome 13. As 
their functions are unknown, they could be potential candidate genes for 
further lab validation. 

5. Discussion. The importance of gene-gene interaction in complex traits 
has stimulated enormous discussion. Fundamental works in statistical meth- 
odology development have been broadly pursued in this area [reviewed in 
Cordell (2009)]. Previous investigations have demonstrated the importance 
of a gene-centric approach in genetic association studies by simultaneously 
considering all markers in a gene to boost association power [e.g., Cui et al. 
(2008), Buil et al. (2009), Li et al. (2009), Ma et al. (2010)]. This motivates 
us to develop a gene-centric approach to understand gene-gene interactions 
associated with complex diseases. In this paper, we propose a kernel ma- 
chine method for a gene-centric gene-gene interaction analysis. We adopt 
a smoothing spline-ANOVA decomposition method to decompose the ge- 
netic effects of two genes into separate main and interaction effects, and 
further model and test the genetic effects in the reproducing kernel Hilbert 
space. The joint variation of SNP variants within a gene is captured by 
a properly defined kernel function, which enables one to model the interac- 
tion of two genes in a linear reproducing Hilbert space by a cross-product 
of two kernel functions. The kernel machine method is shown mathemati- 
cally to be equivalent to a linear mixed effects model. Thus, testing main 
and interaction effects can be done by testing the significance of different 
variance components. Extensive simulations and analysis of two real data 
sets demonstrate the utility of the gene-centric interaction analysis. 
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He et al. (2010) previously proposed a gene-based interaction method in 
which each gene is summarized by several principle components and inter- 
action was tested through the modeling of the PC terms rather than sin- 
gle SNPs. The authors proposed a weighted genotype scoring method using 
pairwise LD information to test gene-gene interaction. Their method is sim- 
ilar to several other methods which jointly consider information contributed 
by multiple markers [e.g., Chatterjee et al. (2006), Chapman and Clayton 
(2007)]. Our method is fundamentally different from their approach in which 
we capture the joint variation of SNP variants within and between genes by 
kernel functions [see Schaid (2010a) for more discussion of the advantage of 
the kernel methods] . Our method can also be extended to test interaction of 
variants by incorporating various weighting functions to define a kernel mea- 
sure [Wu et al. (2011)], and will be considered in our follow-up investigation. 
It is worth mentioning some other methods for a gene-based analysis such 
as the CTGDR method proposed by Ma et al. (2010) which accounts for the 
gene and SNP-within-gene hierarchical structure. The method can identify 
predictive genes as well as predictive SNPs within identified genes. Other 
methods such as the grouped lasso method can also be applied for a gene- 
based analysis [Ma, Song and Huang (2007)]. However, their extension to 
a gene interaction analysis is not straightforward and warrants further in- 
vestigation. 

The advantage of the gene-centric gene-gene interaction analysis was pre- 
viously discussed in He et al. (2010), including reducing the number of hy- 
pothesis tests in a genome-wide scan compared to a single SNP interaction 
analysis. However, we should not over-emphasize the role of gene-centric 
analysis. Our simulation study shows that when the underlying truth is 
a single SNP pair interaction in two genes, single-SNP interaction analysis 
performs better. This result agrees with the conclusion made by He et al. 
(2010). Therefore, we recommend that investigators conduct both types of 
analyses (single SNP and gene-centric) in real applications, especially when 
no prior knowledge is available on how SNPs function within a gene as well 
as between genes. For a large-scale genome-wide or candidate gene study, 
one can also use the gene-centric approach as a screening tool, then further 
target which SNPs in different genes interact with each other. 

The choice of kernel function may have potential effects on the testing 
power [Schaid (2010a, 2010b)]. In this paper, we consider the AM kernel. 
Other kernel functions can also be applied as discussed in detail in Schaid 
(2010b). It is not the purpose of this paper to compare the performance of 
difference kernel choices on the power of an association test. A comparison 
study of different kernel functions on the power of the interaction test will 
be considered in future investigations. 

The proposed method considers genes as testing units to test their inter- 
action. It is easy to extend the idea to incorporate other genomic features 
such as pathways as testing units to assess pathway-pathway interaction un- 
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der the proposed framework. The mapping results can then be visualized by 
some network graphical tools such as the Cytospace software [Shannon et al. 
(2003)] which can help investigators generate important biological hypothe- 
ses for further lab validation. The computational code written in R is avail- 
able as an R-package from http : //cran. r-project . org/package=SPA3G. 

APPENDIX A: ANOVA DECOMPOSITION 

Suppose a function / is on domain r = T\ <^T2^T^. Define corresponding 
averaging operator on each generic domain T 7 ,7 = 1,2,3. An ANOVA 
decomposition of a function / can be obtained: 



/=|lI( J -A + A)|/ 



.7= 

= {(/ - A 1 )(I - A 2 )(I - A 3 ) + (I — Ax)(I - A 2 )A 3 

+ (/ - A 1 )A 2 {I - A 3 ) + A 1 (I- A 2 )(I - A 3 ) 

+ (/ - A X )A 2 A 3 + A 1 A 2 (I - A 3 ) +A 1 (I- A 2 )A 3 + A±A 2 A 3 }f. 

For a nested domain (ri ® T 2 ) (g) T 3 , let A± 2 be the averaging operator on 
domain (ri ®T 2 ). Then the ANOVA decomposition becomes 

/ = {(/- A 12 )(I - A 3 ) + A 12 (I - A 3 ) + (/ - A 12 )A 3 + A 12 A 3 }f. 

Since 

(I - Ai)(J - A 2 ) + (J - Ai)A 2 + A 1 (I-A 2 )=I- A X A 2 
by letting A X2 = A X A 2 , 

{(I - A l2 ){I - A 3 ) + A 12 (I - A 3 ) + (/ - A 12 )A 3 + A 12 A 3 }f 



Uil-Ay + AyAf. 

.7=1 J 



Recursively, it shows that the ANOVA decomposition can also be conducted 
on products of nested domains. 

APPENDIX B: THE DUAL REPRESENTATION 

Consider the linear mixed effect model: 

y = /til + mi + m 2 + mi 2 + e, 

where mi,m 2 ,mi 2 are independent nxl vector of random effects; mi ~ 
N(0,t?Ki), m2~iV(0,TfK 2 ), m u ~ N(0, r|K 3 ), and e ~ N(0, a 2 1) is in- 
dependent of mi,m 2 and mi 2 . Henderson's normal equation for obtaining 
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the BLUPs of the random effects is 



(B.l) 
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(J 
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-mu- 




. I . 



It can be shown that this normal equation is equivalent to the first order 
condition for estimating function m, equation (2.8). Multiplying both sides 
of (2.8) by the matrix, 
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Letting m e = K e C e {I = 1,2), m 12 = K 3 C 3 and r| = a 2 /X e (I = 1, 2, 3), the 
system is exactly equation (B.l), which is Henderson's normal equation of 
linear mixed effects model (2.9). 

APPENDIX C: CALCULATION OF THE KERNEL MATRIX: 

AN EXAMPLE 

As an example, we consider three unrelated individuals. Suppose there 
are 10 SNP markers in one gene. The genotypes for these markers are nu- 
merically coded as 0, 1 or 2 depending on the copy number of a certain allele 
(e.g., A). So genotypes aa, Aa and AA are coded as 0, 1 and 2, respectively. 
See Table 4 for the numerical genotype coding of the three individuals at 
the ten marker position. 

At a certain SNP position, the allele matching score of the genotypes for 
two individuals i and j (denoted as AMy), is calculated as the total number 
of alleles which are identical by state (see Table 3). Then the genomic simi- 
larity score across the whole gene between two individuals i and j (denoted 
as GSSjj) is then calculated as the weighted sum of allele matching scores 
for all SNP markers in that gene (see Table 4). 

With the genomic similarity measurements between all possible individual 
pairs [in the example, (1,2), (1,3) and (2,3)], the 3x3 kernel matrix based 
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Table 3 

Calculation of the allele matching score 



9i,s 




9j,s 




AA(2) 


Aa(l) 


aa(0) 


AA(2) 


4 


2 





Aa(l) 


2 


2 


2 


aa(0) 





2 


4 



Table 4 

Illustration on the calculation of genomic similarity score 



SNPi SNP 2 SNP 3 SNP 4 SNP 5 SNP 6 SNP 7 SNP 8 SNP 9 SNPio 



Individual 1 


2 





2 




1 


1 





1 


1 


1 


1 


Individual 2 

























1 








Individual 3 













1 


1 





1 





1 


1 


AM12 





4 







2 


2 


4 


2 


2 


2 


2 


GSS12 






(0 4-4 + 


0- 


-24 


2 + 4 + 2 + 2 + 2 + 2)/(4 x 


10) 


= 0.5 




AM13 





4 







2 


2 


4 


2 


2 


2 


2 


GSS13 






(0 + 4 + 


0- 


-24 


24-4 + 2 + 2 + 2 + 2)/(4 x 


10) 


= 0.5 




AM 23 


4 


4 


4 




2 


2 


4 


2 


2 


2 


2 


GSS23 






(4 4-4 4 


4- 


-24 


2 + 4 + 2 + 2 + 2 + 2)/(4 x 


10) 


= 0.7 





on the example data is given as 



K 



1 

0.5 
0.5 



0.5 
1 

0.7 



0.5 
0.7 
1 
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